#Loading packages and data
library(readtext)
library(quanteda)
library(stm)
library(dplyr)
library(stringr)
library(ggplot2)
We collected data on regional innovation programs for the following EU member states (Czech Republic, Germany, Poland, Portugal, and UK). Regional programs are available for two time periods: before and after introduction of conditionality in 2013.
Here we look at regional programs, usually drafted by NUTS2 level authorities, and not more general state programs. The exception is Czech Republic that consists of NUTS1 regions only.
We perform analysis by country, starting with Germany below.
Reading in data. These are PDFs of regional innovation programs that have been converted into plain text and then translated into English using EC JRC software.
DATA_DIR <- "~/Dropbox/Research/EC JRC project/Data/English texts/"
ec_files <- readtext(paste0(DATA_DIR, "/*"),
docvarsfrom = "filenames",
dvsep="_",
docvarnames = c("RCode", "Region", "Year", "Lang"))
ec_files$condition <- as.numeric(ec_files$Year > 2013)
ec_files$Country <- word(row.names(ec_files), 1, sep = fixed('/'))
ec_corpus <- corpus(ec_files, text_field = "text")
#subset of corpus per country, as automated translation is imperfect
#and picking up language specific topis
#so estimation is done country by country
de_corpus <- corpus_subset(ec_corpus, Country=="Germany")
pl_corpus <- corpus_subset(ec_corpus, Country=="Poland")
pt_corpus <- corpus_subset(ec_corpus, Country=="Portugal")
cz_corpus <- corpus_subset(ec_corpus, Country=="Czech Republic")
#tokenizing
tok <- tokens(de_corpus, what = "word",
removePunct = TRUE,
removeSymbols = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
removeURL = TRUE,
removeHyphens = TRUE,
verbose = TRUE)
Starting tokenization...
...tokenizing 1 of 1 blocks...
, removing URLs...hashing tokens
...total elapsed: 0.741 seconds.
Finished tokenizing and cleaning 36 texts.
dfm <- dfm(tok,
tolower = TRUE,
remove= c(stopwords("SMART"), "melt", "hessen", "hamburg", "schleswig", "holstein",
"bavaria", "bavarian", "berlin", "bremen", "saxoni", "northrhine", "westphalia",
"rhineland", "saarland", "mecklenburg", "wÿrttemberg", "baden", "anhalt",
"chsischen", "chsisch", "pomorski", "zachodniopomorski", "kujawsko", "łódź",
"podkarpacki", "mazovia", "mazowiecki", "pomerania", "świętokrzyski", "podlaski",
"warmia", "lubuski", "malopolska", "małopolsk", "wielkopolska", "sląskie",
"działao", "mazuri", "olsztyn", "łódzkie", "mazurskiego", "góra", "silesian",
"warsaw", "rzeszów", "mazowsz", "mazowieckich", "małopolska", "małopolski",
"śląskie", "kraków", "alentejo", "tejo", "lisboa", "azor", "algarv", "czech",
"republ", "tejo", "sociedad", "madeira", "lisbon", "saxoni", "bavaria", "sachsen"),
stem=TRUE,
verbose = TRUE)
Creating a dfm from a tokens object ...
... lowercasing
... found 36 documents, 56,220 features
... removed 490 features, from 632 supplied (glob) feature types
... stemming features (English), trimmed 6493 feature variants
... created a 36 x 49,237 sparse dfm
... complete.
Elapsed time: 0.822 seconds.
#Removing any digits. `dfm` picks up any separated digits, not digits that are part of tokens.
dfm.m <- dfm_select(dfm, '[\\d-]', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 2,875 features, from 1 supplied (regex) feature types
#Removing any punctuation. `dfm` picks up any punctuation unless it's part of a token.
dfm.m <- dfm_select(dfm.m, "[[:punct:]]", selection = "remove",
valuetype="regex", verbose = TRUE)
removed 1,042 features, from 1 supplied (regex) feature types
#Removing any tokens less than four characters.
dfm.m <- dfm_select(dfm.m, '^.{1,3}$', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 1,963 features, from 1 supplied (regex) feature types
#Dropping words that appear less than 10 times and in less than 5 documents.
dfm.trim <- dfm_trim(dfm.m, min_count = 10, min_docfreq = 5)
Removing features occurring:
- fewer than 10 times: 39,146
- in fewer than 5 documents: 38,753
Total features removed: 39,762 (91.7%).
topfeatures(dfm.trim, n = 50)
innov develop research region technolog close strategi industri area
8149 7130 6206 4699 4694 2894 2836 2608 2506
support programm institut import project network cluster sector energi
2483 2414 2374 2328 2323 2281 2225 2221 2194
object product high busi econom compani scienc cooper increas
2162 2048 2016 2004 1987 1919 1918 1899 1898
countri economi work discuss knowledg servic market univers field
1896 1889 1872 1858 1736 1691 1668 1619 1611
involv competit structur process potenti fund unterstu polici implement
1603 1592 1588 1572 1566 1564 1501 1483 1474
germani employ educ result oper
1465 1427 1423 1413 1410
sparsity(dfm.trim)
[1] 0.5655617
nfeature(dfm.trim)
[1] 3595
Converting DFM into format for STM package
stm.dfm <- convert(dfm.trim, to = "stm", docvars = docvars(de_corpus))
In the analysis we consider the effect of introducing conditionality on the content of innovation programs. In order to achieve that we implement a structural topic model (Roberts et al., 2015). We model topic prevalence in the context of the conditionality condition marking programs before and after 2013. In addition, we control for region fixed effects. The aim is to allow the observed metadata to affect the frequency with which a topic is discussed in innovation programs. This allows us to test the degree of association between conditionality (and region effects) and the average proportion of a document discussing a topic.
We assess the optimal number of topics that need to specified for the STM analysis. We follow original STM paper and focus on exclusivity and semantic coherence measures. Mimno et al. (2011) propose semantic coherence measure, closely related to point-wise mutual information measure proposed by Newman et al. (2010) to evaluate topic quality. Mimno et al. (2011) show that semantic coherence corresponds to expert judgments and more general human judgments in Amazon’s Mechanical Turk experiments.
Exclusivity score for each topic follows Bischof and Airoldi (2012). Highly frequent words in a given topic that don’t appear too often in other topics are said to make that topic exclusive. Cohesive and exclusive topics are more semantically useful. Following Roberts et al. (2015) we generate a set of candidate models ranging between 3 and 50 topics.
We then plot the exclusivity and semantic coherence (numbers closer to zero indicate higher coherence), and select a model on the semantic coherence-exclusivity “frontier”, that is, where no model strictly dominates another in terms of semantic coherence and exclusivity.
#pdf("search.pdf")
par(mar=c(5,4,4,5)+.1)
plot(search$results$K,search$results$exclus,type="l",col="red",
xlab="Number of topics", ylab="Exclusivity")
axis(side=1,at=seq(0,50,5))
#abline(v=c(5,10), col="green")
par(new=TRUE)
plot(search$results$K, search$results$semcoh,
type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4)
mtext("Semantic Coherence",side=4,line=3)
legend("right",col=c("red","blue"),lty=1,legend=c("excl","sem coh"))
#dev.off()
We select the model with 7 topics for our analysis – there’s a drop in semantic coherence after \(k=7\). We estimate the model with conditionality and region covariates.
One way to summarize topics is to combine term frequency and exclusivity to that topic into a univariate summary statistic. In STM package this is implemented as FREX following Bischof and Airoldi (2012) and Airoldi and Bischof (2016). The logic behind this measure is that both frequency and exclusivity are important factors in determining semantic content of a word and form a two dimensional summary of topical content. FREX is the geometric average of frequency and exclusivity and can be viewed as a univariate measure of topical importance. STM authors suggest that nonexclusive words are less likely to carry topic-specific content, while infrequent words occur too rarely to form the semantic core of a topic. FREX is therefore combining information from the most frequent words in the corpus that are also likely to have been generated from the topic of interest to summarize its content. In practice, topic quality is are usually evaluated by highest probability words.
Here, we look both at FREX and highest probability words.
labelTopics(topics7)
Topic 1 Top Words:
Highest Prob: research, innov, develop, technolog, industri, work, institut
FREX: contract, cent, parti, contractor, academ, kšnnen, brandenburg
Lift: hšlfte, schÿler, kreativitšt, studiengšng, contractor, klaus, studiengšngen
Score: hšlfte, ÿber, kšnnen, mÿssen, fšrderung, universitšt, mobilitšt
Topic 2 Top Words:
Highest Prob: innov, research, develop, technolog, region, cluster, close
FREX: universita, maritim, schlu, octob, western, ndischen, juli
Lift: sselfunkt, masterpla, nstlich, hochschulischen, fteentwicklung, ndigt, innovationsunterstu
Score: sselfunkt, unterstu, universita, schlu, rken, fachkra, wertscho
Topic 3 Top Words:
Highest Prob: assess, instrument, task, import, technolog, network, unterstu
FREX: satisfi, reserv, nanotech, assess, input, benchmark, instrument
Lift: satisfi, umwelttech, nanotech, satisfact, reserv, instru, origina
Score: satisfi, unterstu, bergeordneten, umwelttech, nanotech, stribut, dafu
Topic 4 Top Words:
Highest Prob: develop, region, citi, programm, area, object, project
FREX: citi, ršumlich, port, wettbewerbsfšhigkeit, urban, beschšftigung, flšchen
Lift: flšchen, ršumlich, stšdte, stšdtischen, stšdten, prioritšt, residenti
Score: flšchen, ÿber, fšrderung, wettbewerbsfšhigkeit, ršumlich, beschšftigung, kšnnen
Topic 5 Top Words:
Highest Prob: innov, research, develop, technolog, region, strategi, institut
FREX: lower, saxoni, frankfurt, smart, projekt, nchen, darmstadt
Lift: spain, ernšhrungswirtschaft, schlÿsseltechnologien, komplementšr, fÿhrender, forschungsfšrderung, ausgrÿndungen
Score: spain, ÿber, fšrderung, kšnnen, stšrken, universitšt, stšrkung
Topic 6 Top Words:
Highest Prob: region, develop, programm, support, innov, oper, close
FREX: priorita, tsachs, ftigung, wettbewerbsfa, erdf, koha, tourism
Lift: auftra, verwaltungsbeho, rderbedarf, tsachs, dtischen, ndlicher, rderfa
Score: auftra, priorita, tsachs, wettbewerbsfa, ftigung, unterstu, bevo
Topic 7 Top Words:
Highest Prob: innov, research, develop, strategi, technolog, product, close
FREX: chsisch, mobilita, struggl, ment, document, saxoni, wertscho
Lift: niedersa, chsisch, tskonzept, passfa, struggl, strategisch, copi
Score: niedersa, unterstu, chsisch, mobilita, struggl, fachkra, rken
Plotting the same:
plot(topics7,type="labels", n = 15, text.cex = .6)
Here we have expected proportions of the corpus that belongs to each topic.
plot(topics7,type="summary", xlim = c(0, 1), n = 10, text.cex = .5)
Topics 5 and 7 appear to have very similar top words. We can plot the contrast in words across these two topics. This plot calculates the difference in probability of a word for the two topics, normalized by the maximum difference in probability of any word between the two topics.
plot.STM(topics7, type = "perspectives", topics = c(5,7))
We can also learn more about each topic with wordclouds. The first plot shows marginal probability of words in the corpus. We are plotting top 100 words.
cloud(topics7, topic = NULL, scale = c(3, .25), max.words = 100)
Wordcloud for Topic 1:
cloud(topics7, topic = 1, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 2:
cloud(topics7, topic = 2, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 3:
cloud(topics7, topic = 3, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 4:
cloud(topics7, topic = 4, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 5:
cloud(topics7, topic = 5, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 6:
cloud(topics7, topic = 6, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 7:
cloud(topics7, topic = 7, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
We look at the relationship between topic proportions and conditionality factor. Confidence intervals produced by the method of composition in STM allow us to pick up statistical uncertainty in the linear regression model.
con.eff <- estimateEffect( ~ factor(condition),
topics7, meta = stm.dfm$meta,
uncertainty = "Global")
We now plot the results of the analysis as the difference in topic proportions for two different values of conditionality (before and after the introduction of the policy). Point estimates and 95% confidence intervals plotted.
plot(con.eff, covariate = "condition",
model = topics7, method = "difference",
cov.value1 = 1, cov.value2 = 0, verbose.labels = FALSE, xlim = c(-.5, .7),
main = "Effect of Conditionality")
Figure shows a treatment effect of conditionality introduction in all seven topics, comparing programs before and after introduction of conditionality.
We observe that conditionality introduction had an effect only on Topic 2, while for the remaining topics the effect of conditionality is not distinguishable from zero.
On average, the difference between the proportion of regional innovation programs that discuss Topic 2 after introduction of conditionality requirement and the proportion of programs that discuss Topic 2 before conditionality introduction is .38 (.16, .6).
We can also assess the relationship between topics in the STM framework that allows correlations between topics. Positive correlations between topics suggest that both topics are likely to be covered within an innovation program.
topic.cor <- topicCorr(topics7)
plot.topicCorr(topic.cor)
It appears that only Topics 1 and 5 are linked at the default 0.01 correlation cutoff.
#tokenizing
tok <- tokens(pl_corpus, what = "word",
removePunct = TRUE,
removeSymbols = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
removeURL = TRUE,
removeHyphens = TRUE,
verbose = TRUE)
Starting tokenization...
...tokenizing 1 of 1 blocks...
, removing URLs...hashing tokens
...total elapsed: 0.657000000000153 seconds.
Finished tokenizing and cleaning 28 texts.
dfm <- dfm(tok,
tolower = TRUE,
remove= c(stopwords("SMART"), "melt", "hessen", "hamburg", "schleswig", "holstein",
"bavaria", "bavarian", "berlin", "bremen", "saxoni", "northrhine", "westphalia",
"rhineland", "saarland", "mecklenburg", "wÿrttemberg", "baden", "anhalt",
"chsischen", "chsisch", "pomorski", "zachodniopomorski", "kujawsko", "łódź",
"podkarpacki", "mazovia", "mazowiecki", "pomerania", "świętokrzyski",
"podlaski",
"warmia", "lubuski", "malopolska", "małopolsk", "wielkopolska", "sląskie",
"działao", "mazuri", "olsztyn", "łódzkie", "mazurskiego", "góra", "silesian",
"warsaw", "rzeszów", "mazowsz", "mazowieckich", "małopolska", "małopolski",
"śląskie", "kraków", "alentejo", "tejo", "lisboa", "azor", "algarv", "czech",
"republ", "tejo", "sociedad", "madeira", "lisbon", "saxoni", "bavaria",
"sachsen"),
stem=TRUE,
verbose = TRUE)
Creating a dfm from a tokens object ...
... lowercasing
... found 28 documents, 22,192 features
... removed 494 features, from 632 supplied (glob) feature types
... stemming features (English), trimmed 4761 feature variants
... created a 28 x 16,937 sparse dfm
... complete.
Elapsed time: 0.407 seconds.
#Removing any digits. `dfm` picks up any separated digits, not digits that are part of tokens.
dfm.m <- dfm_select(dfm, '[\\d-]', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 400 features, from 1 supplied (regex) feature types
#Removing any punctuation. `dfm` picks up any punctuation unless it's part of a token.
dfm.m <- dfm_select(dfm.m, "[[:punct:]]", selection = "remove",
valuetype="regex", verbose = TRUE)
removed 301 features, from 1 supplied (regex) feature types
#Removing any tokens less than four characters.
dfm.m <- dfm_select(dfm.m, '^.{1,3}$', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 1,358 features, from 1 supplied (regex) feature types
#Dropping words that appear less than 5 times and in less than 3 documents.
dfm.trim <- dfm_trim(dfm.m, min_count = 5, min_docfreq = 3)
Removing features occurring:
- fewer than 5 times: 11,008
- in fewer than 3 documents: 10,856
Total features removed: 11,555 (77.7%).
topfeatures(dfm.trim, n = 50)
region innov develop technolog support strategi busi research implement
15807 14102 10662 4813 4492 4306 4173 3940 3761
activ project area product institut industri econom specialis servic
3752 3526 3481 3423 3237 2934 2899 2888 2868
compani number cooper sector oper system fund process object
2825 2768 2766 2759 2705 2683 2682 2558 2548
european increas economi educ includ programm level poland centr
2548 2378 2309 2249 2247 2156 2141 2089 2080
enterpris inform market potenti public action invest year smart
2049 2011 1981 1974 1951 1937 1880 1823 1816
promot high nation base competit
1814 1812 1795 1747 1743
sparsity(dfm.trim)
[1] 0.5407119
nfeature(dfm.trim)
[1] 3323
Converting DFM into format for STM package
stm.dfm <- convert(dfm.trim, to = "stm", docvars = docvars(pl_corpus))
In the analysis we consider the effect of introducing conditionality on the content of innovation programs. In order to achieve that we implement a structural topic model (Roberts et al., 2015). We model topic prevalence in the context of the conditionality condition marking programs before and after 2013. In addition, we control for region fixed effects. The aim is to allow the observed metadata to affect the frequency with which a topic is discussed in innovation programs. This allows us to test the degree of association between conditionality (and region effects) and the average proportion of a document discussing a topic.
We assess the optimal number of topics that need to specified for the STM analysis. We follow original STM paper and focus on exclusivity and semantic coherence measures. Mimno et al. (2011) propose semantic coherence measure, closely related to point-wise mutual information measure proposed by Newman et al. (2010) to evaluate topic quality. Mimno et al. (2011) show that semantic coherence corresponds to expert judgments and more general human judgments in Amazon’s Mechanical Turk experiments.
Exclusivity score for each topic follows Bischof and Airoldi (2012). Highly frequent words in a given topic that don’t appear too often in other topics are said to make that topic exclusive. Cohesive and exclusive topics are more semantically useful. Following Roberts et al. (2015) we generate a set of candidate models ranging between 3 and 50 topics.
We then plot the exclusivity and semantic coherence (numbers closer to zero indicate higher coherence), and select a model on the semantic coherence-exclusivity “frontier”, that is, where no model strictly dominates another in terms of semantic coherence and exclusivity.
#pdf("search.pdf")
par(mar=c(5,4,4,5)+.1)
plot(search$results$K,search$results$exclus,type="l",col="red",
xlab="Number of topics", ylab="Exclusivity")
axis(side=1,at=seq(0,50,5))
#abline(v=c(5,10), col="green")
par(new=TRUE)
plot(search$results$K, search$results$semcoh,
type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4)
mtext("Semantic Coherence",side=4,line=3)
legend("right",col=c("red","blue"),lty=1,legend=c("excl","sem coh"))
#dev.off()
We select the model with 6 topics for our analysis – there’s a drop in semantic coherence after \(k=6\). We estimate the model with conditionality and region covariates.
One way to summarize topics is to combine term frequency and exclusivity to that topic into a univariate summary statistic. In STM package this is implemented as FREX following Bischof and Airoldi (2012) and Airoldi and Bischof (2016). The logic behind this measure is that both frequency and exclusivity are important factors in determining semantic content of a word and form a two dimensional summary of topical content. FREX is the geometric average of frequency and exclusivity and can be viewed as a univariate measure of topical importance. STM authors suggest that nonexclusive words are less likely to carry topic-specific content, while infrequent words occur too rarely to form the semantic core of a topic. FREX is therefore combining information from the most frequent words in the corpus that are also likely to have been generated from the topic of interest to summarize its content.
labelTopics(topics6)
Topic 1 Top Words:
Highest Prob: innov, region, develop, strategi, technolog, support, implement
FREX: pomorski, firm, januari, smes, action, zachodniopomorski, version
Lift: instig, urgenc, kreatorem, exempt, boast, branżowo, consent
Score: urgenc, pomorski, innov, bydgoszcz, irop, region, irdop
Topic 2 Top Words:
Highest Prob: region, develop, specialis, innov, smart, strategi, manufactur
FREX: świętokrzyski, prospect, podlaski, updat, manufactur, drawn, cent
Lift: shore, metalowo, venu, error, mail, percent, super
Score: shore, pomorski, smart, świętokrzyski, kielc, podlaski, specialis
Topic 3 Top Words:
Highest Prob: region, develop, area, social, econom, support, increas
FREX: mazuri, urban, municip, citi, mazowsz, rural, central
Lift: bozena, bzura, habitat, lagoon, pilica, rozwiązao, łódz
Score: rozwiązao, mazuri, bovin, jakośd, dostępnośd, społeczeostwo, wartośd
Topic 4 Top Words:
Highest Prob: innov, region, develop, technolog, busi, research, support
FREX: lubuski, lubelski, quantiti, egionalna, lublin, smes, scheme
Lift: bytom, cowników, darczej, doradczej, finansowa, gionalnego, gliwic
Score: gliwic, lubuski, egionalna, kielc, przedsiębior, świętokrzyski, irop
Topic 5 Top Words:
Highest Prob: innov, region, develop, specialis, area, smart, technolog
FREX: podkarpacki, wielkopolski, chain, discoveri, food, specialis, smart
Lift: aerospac, counteract, cosm, bioeconomi, reproduc, furnish, specjalizacji
Score: aerospac, smart, bioeconomi, podkarpacki, specialis, ibid, foray
Topic 6 Top Words:
Highest Prob: region, innov, develop, project, technolog, research, busi
FREX: małopolski, west, realis, pomeranian, compet, financ, project
Lift: meta, inextric, prosum, westpomeranian, małopolski, jagiellonian, dispers
Score: meta, pomeranian, małopolski, prosum, realis, smart, ecosystem
Plotting the same:
plot(topics6,type="labels", n = 10, text.cex = .5)
Here we have expected proportions of the corpus that belongs to each topic.
plot(topics6,type="summary", xlim = c(0, 1), n = 10, text.cex = .5)
Topics 4 and 1 appear to have very similar top words. We can plot the contrast in words across these two topics. This plot calculates the difference in probability of a word for the two topics, normalized by the maximum difference in probability of any word between the two topics.
plot.STM(topics6, type = "perspectives", topics = c(4,1))
We can also learn more about each topic with wordclouds. The first plot shows marginal probability of words in the corpus. We are plotting top 100 words.
cloud(topics6, topic = NULL, scale = c(2, .25), max.words = 100)
Wordcloud for Topic 1:
cloud(topics6, topic = 1, scale = c(4, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 2:
cloud(topics6, topic = 2, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 3:
cloud(topics6, topic = 3, scale = c(4, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 4:
cloud(topics6, topic = 4, scale = c(4, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 5:
cloud(topics6, topic = 5, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 6:
cloud(topics6, topic = 6, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
We look at the relationship between topic proportions and conditionality factor. Confidence intervals produced by the method of composition in STM allow us to pick up statistical uncertainty in the linear regression model.
con.eff <- estimateEffect( ~ factor(condition),
topics6, meta = stm.dfm$meta,
uncertainty = "Global")
We now plot the results of the analysis as the difference in topic proportions for two different values of conditionality (before and after the introduction of the policy). Point estimates and 95% confidence intervals plotted.
plot(con.eff, covariate = "condition",
model = topics6, method = "difference",
cov.value1 = 1, cov.value2 = 0, verbose.labels = FALSE, xlim = c(-1, 1),
main = "Effect of Conditionality")
Figure shows a treatment effect of conditionality introduction in all six topics, comparing programs before and after introduction of conditionality.
We observe that conditionality introduction had an effect on Topics 2, 4, and 5, while for the remaining topics the effect of conditionality is not distinguishable from zero. While conditionality increased the use of Topics 2 and 5, it decreased the use of Topic 4.
On average, the difference between the proportion of regional innovation programs that discuss Topic 2 and Topic 5 after introduction of conditionality requirement and the proportion of programs that discuss Topics 2 and 5 before conditionality introduction is 0.4. At the same time there’s a similar scale degrease in the discussion of Topic 4.
We can also assess the relationship between topics in the STM framework that allows correlations between topics. Positive correlations between topics suggest that both topics are likely to be covered within an innovation program.
topic.cor <- topicCorr(topics6)
plot.topicCorr(topic.cor)
It appears that none of the topics are linked at the default 0.01 correlation cutoff.
#tokenizing
tok <- tokens(pt_corpus, what = "word",
removePunct = TRUE,
removeSymbols = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
removeURL = TRUE,
removeHyphens = TRUE,
verbose = TRUE)
Starting tokenization...
...tokenizing 1 of 1 blocks...
, removing URLs...hashing tokens
...total elapsed: 0.513999999999996 seconds.
Finished tokenizing and cleaning 16 texts.
dfm <- dfm(tok,
tolower = TRUE,
remove= c(stopwords("SMART"), "melt", "hessen", "hamburg", "schleswig", "holstein",
"bavaria", "bavarian", "berlin", "bremen", "saxoni", "northrhine", "westphalia",
"rhineland", "saarland", "mecklenburg", "wÿrttemberg", "baden", "anhalt",
"chsischen", "chsisch", "pomorski", "zachodniopomorski", "kujawsko", "łódź",
"podkarpacki", "mazovia", "mazowiecki", "pomerania", "świętokrzyski",
"podlaski",
"warmia", "lubuski", "malopolska", "małopolsk", "wielkopolska", "sląskie",
"działao", "mazuri", "olsztyn", "łódzkie", "mazurskiego", "góra", "silesian",
"warsaw", "rzeszów", "mazowsz", "mazowieckich", "małopolska", "małopolski",
"śląskie", "kraków", "alentejo", "tejo", "lisboa", "azor", "algarv", "czech",
"republ", "tejo", "sociedad", "madeira", "lisbon", "saxoni", "bavaria",
"sachsen"),
stem=TRUE,
verbose = TRUE)
Creating a dfm from a tokens object ...
... lowercasing
... found 16 documents, 17,383 features
... removed 456 features, from 632 supplied (glob) feature types
... stemming features (English), trimmed 4908 feature variants
... created a 16 x 12,019 sparse dfm
... complete.
Elapsed time: 0.374 seconds.
#Removing any digits. `dfm` picks up any separated digits, not digits that are part of tokens.
dfm.m <- dfm_select(dfm, '[\\d-]', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 317 features, from 1 supplied (regex) feature types
#Removing any punctuation. `dfm` picks up any punctuation unless it's part of a token.
dfm.m <- dfm_select(dfm.m, "[[:punct:]]", selection = "remove",
valuetype="regex", verbose = TRUE)
removed 233 features, from 1 supplied (regex) feature types
#Removing any tokens less than four characters.
dfm.m <- dfm_select(dfm.m, '^.{1,3}$', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 1,139 features, from 1 supplied (regex) feature types
#Dropping words that appear less than 5 times and in less than 3 documents.
dfm.trim <- dfm_trim(dfm.m, min_count = 5, min_docfreq = 3)
Removing features occurring:
- fewer than 5 times: 6,827
- in fewer than 3 documents: 6,915
Total features removed: 7,307 (70.7%).
topfeatures(dfm.trim, n = 50)
region develop area innov product technolog activ sector promot
9811 4327 3789 3494 3056 3044 2691 2387 2224
strategi system support nation project level resourc busi research
2176 2038 1981 1967 1814 1808 1783 1762 1697
tourism polici econom specialis public servic programm industri centr
1678 1657 1654 1642 1628 1527 1480 1430 1414
competit plan market manag structur smart territori prioriti action
1365 1363 1332 1320 1318 1316 1313 1238 1216
strateg knowledg figur implement intern increas potenti portug network
1208 1170 1166 1160 1159 1147 1140 1125 1112
process european invest oper relat
1109 1095 1084 1080 1070
sparsity(dfm.trim)
[1] 0.4473412
nfeature(dfm.trim)
[1] 3023
Converting DFM into format for STM package
stm.dfm <- convert(dfm.trim, to = "stm", docvars = docvars(pt_corpus))
In the analysis we consider the effect of introducing conditionality on the content of innovation programs. In order to achieve that we implement a structural topic model (Roberts et al., 2015). We model topic prevalence in the context of the conditionality condition marking programs before and after 2013. In addition, we control for region fixed effects. The aim is to allow the observed metadata to affect the frequency with which a topic is discussed in innovation programs. This allows us to test the degree of association between conditionality (and region effects) and the average proportion of a document discussing a topic.
We assess the optimal number of topics that need to specified for the STM analysis. We follow original STM paper and focus on exclusivity and semantic coherence measures. Mimno et al. (2011) propose semantic coherence measure, closely related to point-wise mutual information measure proposed by Newman et al. (2010) to evaluate topic quality. Mimno et al. (2011) show that semantic coherence corresponds to expert judgments and more general human judgments in Amazon’s Mechanical Turk experiments.
Exclusivity score for each topic follows Bischof and Airoldi (2012). Highly frequent words in a given topic that don’t appear too often in other topics are said to make that topic exclusive. Cohesive and exclusive topics are more semantically useful. Following Roberts et al. (2015) we generate a set of candidate models ranging between 3 and 50 topics.
We then plot the exclusivity and semantic coherence (numbers closer to zero indicate higher coherence), and select a model on the semantic coherence-exclusivity “frontier”, that is, where no model strictly dominates another in terms of semantic coherence and exclusivity.
par(mar=c(5,4,4,5)+.1)
plot(search$results$K,search$results$exclus,type="l",col="red",
xlab="Number of topics", ylab="Exclusivity")
axis(side=1,at=seq(0,50,5))
#abline(v=c(5,10), col="green")
par(new=TRUE)
plot(search$results$K, search$results$semcoh,
type="l",col="blue",xaxt="n",yaxt="n",xlab="",ylab="")
axis(4)
mtext("Semantic Coherence",side=4,line=3)
legend("right",col=c("red","blue"),lty=1,legend=c("excl","sem coh"))
We select the model with 10 topics for our analysis – there’s no improvement in semantic coherence after \(k=10\). We estimate the model with conditionality and region covariates.
One way to summarize topics is to combine term frequency and exclusivity to that topic into a univariate summary statistic. In STM package this is implemented as FREX following Bischof and Airoldi (2012) and Airoldi and Bischof (2016). The logic behind this measure is that both frequency and exclusivity are important factors in determining semantic content of a word and form a two dimensional summary of topical content. FREX is the geometric average of frequency and exclusivity and can be viewed as a univariate measure of topical importance. STM authors suggest that nonexclusive words are less likely to carry topic-specific content, while infrequent words occur too rarely to form the semantic core of a topic. FREX is therefore combining information from the most frequent words in the corpus that are also likely to have been generated from the topic of interest to summarize its content.
labelTopics(topics10)
Topic 1 Top Words:
Highest Prob: region, project, programm, technolog, action, support, innov
FREX: vale, approv, report, final, submit, project, steer
Lift: ritt, prai, tagusvalley, ourém, oest, reprogram, iasp
Score: ritt, tagusvalley, project, vale, prai, programm, reprogram
Topic 2 Top Words:
Highest Prob: region, develop, specialis, smart, product, activ, technolog
FREX: northern, ration, smart, specialis, comput, asset, textil
Lift: coat, tractor, symbol, valor, cumul, grade, spice
Score: coat, smart, specialis, northern, manufactur, cloth, textil
Topic 3 Top Words:
Highest Prob: programm, innov, fund, project, region, oper, financi
FREX: regul, erdf, articl, elig, expenditur, member, financi
Lift: merg, debt, hear, statutori, oblig, projecto, multiannu
Score: merg, projecto, erdf, gender, debt, hear, articl
Topic 4 Top Words:
Highest Prob: area, region, urban, territori, develop, system, promot
FREX: prot, urban, beira, serra, mondego, road, corridor
Lift: serra, prohibit, prot, mondego, alta, corridor, pinhal
Score: serra, mondego, inst, beira, leiria, pinhal, viseu
Topic 5 Top Words:
Highest Prob: region, centr, innov, central, portug, product, area
FREX: central, portug, centr, materi, tice, manufactur, forest
Lift: pulp, quotient, politécnico, biocant, fraqueza, ppas, toxicolog
Score: pulp, tice, smart, central, centr, quotient, biocant
Topic 6 Top Words:
Highest Prob: region, azor, develop, tourism, area, activ, innov
FREX: azor, autonom, island, livestock, fisheri, govern, fish
Lift: azor, island, archipelago, campus, galician, dive, canari
Score: azor, smart, island, autonom, tourism, campus, outermost
Topic 7 Top Words:
Highest Prob: region, innov, area, develop, technolog, strategi, promot
FREX: version, decemb, polici, smart, climat, internationalis, align
Lift: listen, repositori, postgradu, subsoil, ipctn, esif, dgeec
Score: listen, smart, specialis, version, decemb, esif, satellit
Topic 8 Top Words:
Highest Prob: region, sector, product, innov, technolog, project, level
FREX: augusto, mateus, advis, sociedad, drawn, team, graph
Lift: augusto, mateus, sociedad, decor, graph, advis, mediat
Score: augusto, mateus, advis, sociedad, graph, cork, aeronaut
Topic 9 Top Words:
Highest Prob: region, algarv, activ, sector, innov, product, develop
FREX: algarv, faro, ccdr, season, unemploy, aquacultur, accommod
Lift: motorcycl, olhão, tavira, algarv, footnot, faro, iefp
Score: motorcycl, algarv, faro, smart, tourism, aquacultur, albufeira
Topic 10 Top Words:
Highest Prob: region, develop, nation, public, territori, oper, polici
FREX: nsrf, north, converg, cohes, portugues, page, scenario
Lift: tecno, braga, offset, infor, page, gothenburg, manoeuvr
Score: tecno, nsrf, north, page, territori, douro, gender
Plotting the same:
plot(topics10,type="labels", n = 10, text.cex = .5)
Here we have expected proportions of the corpus that belongs to each topic.
plot(topics10,type="summary", xlim = c(0, 1), n = 10, text.cex = .5)
Topics 7 and 1 appear to have very similar top words. We can plot the contrast in words across these two topics. This plot calculates the difference in probability of a word for the two topics, normalized by the maximum difference in probability of any word between the two topics.
plot.STM(topics10, type = "perspectives", topics = c(7,1))
We can also learn more about each topic with wordclouds. The first plot shows marginal probability of words in the corpus. We are plotting top 100 words.
cloud(topics10, topic = NULL, scale = c(2, .25), max.words = 100)
Wordcloud for Topic 1:
cloud(topics10, topic = 1, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 2:
cloud(topics10, topic = 2, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 3:
cloud(topics10, topic = 3, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 4:
cloud(topics10, topic = 4, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 5:
cloud(topics10, topic = 5, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 6:
cloud(topics10, topic = 6, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 7:
cloud(topics10, topic = 7, scale = c(3, .1), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 8:
cloud(topics10, topic = 8, scale = c(3, .3), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 9:
cloud(topics10, topic = 9, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 10:
cloud(topics10, topic = 10, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
We look at the relationship between topic proportions and conditionality factor. Confidence intervals produced by the method of composition in STM allow us to pick up statistical uncertainty in the linear regression model.
con.eff <- estimateEffect( ~ factor(condition),
topics10, meta = stm.dfm$meta,
uncertainty = "Global")
We now plot the results of the analysis as the difference in topic proportions for two different values of conditionality (before and after the introduction of the policy). Point estimates and 95% confidence intervals plotted.
plot(con.eff, covariate = "condition",
model = topics10, method = "difference",
cov.value1 = 1, cov.value2 = 0, verbose.labels = FALSE, xlim = c(-1, 1),
ci.level=.95,
main = "Effect of Conditionality")
Figure shows a treatment effect of conditionality introduction in all ten topics, comparing programs before and after introduction of conditionality.
We observe that conditionality introduction had an effect on Topic 7, while for the remaining topics the effect of conditionality is not distinguishable from zero.
On average, the difference between the proportion of regional innovation programs that discuss Topic 7 after introduction of conditionality requirement and the proportion of programs that discuss Topic 7 before conditionality introduction is 0.4.
We can also assess the relationship between topics in the STM framework that allows correlations between topics. Positive correlations between topics suggest that both topics are likely to be covered within an innovation program.
topic.cor <- topicCorr(topics10)
plot.topicCorr(topic.cor)
It appears that Topics 3 and 10 are linked at the default 0.01 correlation cutoff.
#tokenizing
tok <- tokens(cz_corpus, what = "word",
removePunct = TRUE,
removeSymbols = TRUE,
removeNumbers = TRUE,
removeTwitter = TRUE,
removeURL = TRUE,
removeHyphens = TRUE,
verbose = TRUE)
Starting tokenization...
...tokenizing 1 of 1 blocks...
, removing URLs...hashing tokens
...total elapsed: 0.15300000000002 seconds.
Finished tokenizing and cleaning 3 texts.
dfm <- dfm(tok,
tolower = TRUE,
remove= c(stopwords("SMART"), "melt", "hessen", "hamburg", "schleswig", "holstein",
"bavaria", "bavarian", "berlin", "bremen", "saxoni", "northrhine", "westphalia",
"rhineland", "saarland", "mecklenburg", "wÿrttemberg", "baden", "anhalt",
"chsischen", "chsisch", "pomorski", "zachodniopomorski", "kujawsko", "łódź",
"podkarpacki", "mazovia", "mazowiecki", "pomerania", "świętokrzyski",
"podlaski", "warmia", "lubuski", "malopolska", "małopolsk", "wielkopolska",
"sląskie", " republ",
"działao", "mazuri", "olsztyn", "łódzkie", "mazurskiego", "góra", "silesian",
"warsaw", "rzeszów", "mazowsz", "mazowieckich", "małopolska", "małopolski",
"śląskie", "kraków", "alentejo", "tejo", "lisboa", "azor", "algarv", "czech",
"republ", "tejo", "sociedad", "madeira", "lisbon", "saxoni", "bavaria",
"sachsen"),
stem=TRUE,
verbose = TRUE)
Creating a dfm from a tokens object ...
... lowercasing
... found 3 documents, 7,068 features
... removed 401 features, from 633 supplied (glob) feature types
... stemming features (English), trimmed 2318 feature variants
... created a 3 x 4,349 sparse dfm
... complete.
Elapsed time: 0.264 seconds.
#Removing any digits. `dfm` picks up any separated digits, not digits that are part of tokens.
dfm.m <- dfm_select(dfm, '[\\d-]', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 103 features, from 1 supplied (regex) feature types
#Removing any punctuation. `dfm` picks up any punctuation unless it's part of a token.
dfm.m <- dfm_select(dfm.m, "[[:punct:]]", selection = "remove",
valuetype="regex", verbose = TRUE)
removed 35 features, from 1 supplied (regex) feature types
#Removing any tokens less than four characters.
dfm.m <- dfm_select(dfm.m, '^.{1,3}$', selection = "remove",
valuetype="regex", verbose = TRUE)
removed 364 features, from 1 supplied (regex) feature types
#Dropping words that appear less than 2 times and in less than 2 documents.
dfm.trim <- dfm_trim(dfm.m, min_count = 2, min_docfreq = 2)
Removing features occurring:
- fewer than 2 times: 1,449
- in fewer than 2 documents: 2,138
Total features removed: 2,138 (55.6%).
topfeatures(dfm.trim, n = 50)
research republ develop innov public nation support
1441 1066 1053 928 613 544 520
econom increas level activ system busi strategi
488 459 459 452 451 435 431
region programm growth sector area implement knowledg
422 418 411 403 393 370 363
competit firm result fund qualiti servic intern
351 338 334 330 329 327 323
technolog countri educ manag market resourc state
321 317 316 313 303 296 285
product polici number organis economi industri term
285 281 276 273 271 265 261
object specialis foreign project invest high european
257 256 254 253 252 251 236
infrastructur
231
sparsity(dfm.trim)
[1] 0.1677394
nfeature(dfm.trim)
[1] 1709
Converting DFM into format for STM package
stm.dfm <- convert(dfm.trim, to = "stm", docvars = docvars(cz_corpus))
In the analysis we consider the effect of introducing conditionality on the content of innovation programs. In order to achieve that we implement a structural topic model (Roberts et al., 2015). We model topic prevalence in the context of the conditionality condition marking programs before and after 2013. In contrast to the above analysis, we do not control for region fixed effects as we have only Czech national documents. The aim is to allow the observed metadata to affect the frequency with which a topic is discussed in innovation programs. This allows us to test the degree of association between conditionality (and region effects) and the average proportion of a document discussing a topic.
Given that we have only three documents for CZ, we cannot perform the topic search procedures we implemented above. To keep it in context we settle for the number of topics similar to previous analyses (six topics).
One way to summarize topics is to combine term frequency and exclusivity to that topic into a univariate summary statistic. In STM package this is implemented as FREX following Bischof and Airoldi (2012) and Airoldi and Bischof (2016). The logic behind this measure is that both frequency and exclusivity are important factors in determining semantic content of a word and form a two dimensional summary of topical content. FREX is the geometric average of frequency and exclusivity and can be viewed as a univariate measure of topical importance. STM authors suggest that nonexclusive words are less likely to carry topic-specific content, while infrequent words occur too rarely to form the semantic core of a topic. FREX is therefore combining information from the most frequent words in the corpus that are also likely to have been generated from the topic of interest to summarize its content.
labelTopics(topics6)
Topic 1 Top Words:
Highest Prob: research, develop, innov, support, republ, activ, public
FREX: earmark, support, prioriti, research, evalu, societi, activ
Lift: earmark, foresight, compos, under, opei, specialist, onward
Score: earmark, vavpl, research, commercialis, aspect, excel, opei
Topic 2 Top Words:
Highest Prob: econom, public, growth, strategi, develop, legisl, republ
FREX: pass, languag, final, output, insur, land, transport
Lift: unemploy, pass, taxat, land, court, privatis, fiscal
Score: unemploy, languag, pass, insur, case, privatis, final
Topic 3 Top Words:
Highest Prob: research, republ, innov, develop, nation, firm, region
FREX: intervent, specialis, domain, digit, manufactur, firm, talent
Lift: domain, vertic, intervent, specialis, experiment, talent, evid
Score: domain, intervent, case, manufactur, digit, tabl, specialis
Topic 4 Top Words:
Highest Prob: research, develop, innov, republ, support, activ, public
FREX: research, innov, develop, activ, knowledg, object, organis
Lift: defenc, vavpl, pursuit, mediatis, nerv, news, panel
Score: defenc, research, innov, develop, vavpl, excel, commercialis
Topic 5 Top Words:
Highest Prob: research, develop, innov, support, liabil, republ, public
FREX: liabil, support, prioriti, evalu, polici, develop, activ
Lift: liabil, scarc, abovement, voucher, ecop, announc, white
Score: liabil, research, support, develop, innov, prioriti, activ
Topic 6 Top Words:
Highest Prob: research, develop, innov, support, republ, deadlin, public
FREX: deadlin, support, prioriti, polici, evalu, develop, establish
Lift: deadlin, scarc, abovement, voucher, announc, august, buoyant
Score: deadlin, research, support, develop, innov, prioriti, activ
Plotting the same:
plot(topics6,type="labels", n = 10, text.cex = .5)
Here we have expected proportions of the corpus that belongs to each topic.
plot(topics6,type="summary", xlim = c(0, 1), n = 10, text.cex = .5)
Topics 8 and 1 appear to have very similar top words. We can plot the contrast in words across these two topics. This plot calculates the difference in probability of a word for the two topics, normalized by the maximum difference in probability of any word between the two topics.
plot.STM(topics6, type = "perspectives", topics = c(2,3))
We can also learn more about each topic with wordclouds. The first plot shows marginal probability of words in the corpus. We are plotting top 100 words.
cloud(topics6, topic = NULL, scale = c(2, .25), max.words = 100)
Wordcloud for Topic 1:
cloud(topics6, topic = 1, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 2:
cloud(topics6, topic = 2, scale = c(3, .3), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 3:
cloud(topics6, topic = 3, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 4:
cloud(topics6, topic = 4, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 5:
cloud(topics6, topic = 5, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
Wordcloud for Topic 6:
cloud(topics6, topic = 6, scale = c(3, .5), random.order = FALSE,rot.per = .3, max.words = 100)
We look at the relationship between topic proportions and conditionality factor. Confidence intervals produced by the method of composition in STM allow us to pick up statistical uncertainty in the linear regression model.
con.eff <- estimateEffect( ~ factor(condition),
topics6, meta = stm.dfm$meta,
uncertainty = "Global")
We now plot the results of the analysis as the difference in topic proportions for two different values of conditionality (before and after the introduction of the policy). Point estimates and 95% confidence intervals plotted.
plot(con.eff, covariate = "condition",
model = topics6, method = "difference",
cov.value1 = 1, cov.value2 = 0, verbose.labels = FALSE, xlim = c(-1, 1),
ci.level=.95,
main = "Effect of Conditionality")
Figure shows a treatment effect of conditionality introduction in all six topics, comparing programs before and after introduction of conditionality.
We observe that conditionality introduction had an effect on Topic 3, while for the remaining topics the effect of conditionality is not distinguishable from zero.
On average, the difference between the proportion of regional innovation programs that discuss Topic 3 after introduction of conditionality requirement and the proportion of programs that discuss Topic 3 before conditionality introduction is 0.8.
We can also assess the relationship between topics in the STM framework that allows correlations between topics. Positive correlations between topics suggest that both topics are likely to be covered within an innovation program.
topic.cor <- topicCorr(topics6)
plot.topicCorr(topic.cor)
It appears that all topics apart from Topic 2 are linked at the default 0.01 correlation cutoff.